Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.
To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.
Additional instructions (and screenshots) on using R, RStudio and Binder are available in Workshop1 Tutorial on BB.

Analyse Data in R

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computer) to try and avoid having to download the entire tidyverse. Please install the packages if you’re getting error when running the library() command. In addition we will use the plotly package to create interactive plots.
To install these packages, we use the install.packages('package') command, please note that the package name need to be quoted and that we only need to perform it once (if we followed step 3b above, or every time we restart the server if we didn’t), or when we want or need to update the package. Once the package was installed, we can load its functions using the library(package) command. Note that in this case we use the package name without quotes!.

# install required packages - needed only once! (comment with a # after first use)
# install.packages(c("dplyr", "ggplot2", "readr","RColorBrewer", "Rmisc", "plotly"))
install.packages("tibble")
# load required packages
library(dplyr)
library(readr)
library(ggplot2)
library(RColorBrewer)

More information on R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We need to provide a variable name that will store the data and remember that R holds its variables in the computers RAM (which will be the limiting factor in terms of size of data that can be handled).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file hosted on the web into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R (which will slightly change the structure of the resulting data frame).

# read data straight form the web
gapdata <- read_csv("https://tinyurl.com/gapdata")

Data Exploration

We can explore the data by clicking on the table icon next to the variable name in RStudio “Environment” tab (top right pane), but that’s not a good practice because it won’t work with large data sets.
We can use built-in functions for a brief exploration, such as head() to show the first 10 rows of the data and str() or glimpse() for the type of data in each column:

#explore the data frame
head(gapdata) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 6
##   country      year      pop continent lifeExp gdpPercap
##   <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>
## 1 Afghanistan  1952  8425333 Asia         28.8      779.
## 2 Afghanistan  1957  9240934 Asia         30.3      821.
## 3 Afghanistan  1962 10267083 Asia         32.0      853.
## 4 Afghanistan  1967 11537966 Asia         34.0      836.
## 5 Afghanistan  1972 13079460 Asia         36.1      740.
## 6 Afghanistan  1977 14880372 Asia         38.4      786.
str(gapdata) # show data structure
## spec_tbl_df [1,704 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : num [1:1704] 1952 1957 1962 1967 1972 ...
##  $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   pop = col_double(),
##   ..   continent = col_character(),
##   ..   lifeExp = col_double(),
##   ..   gdpPercap = col_double()
##   .. )
glimpse(gapdata) # show data structure using a tidyverse function
## Rows: 1,704
## Columns: 6
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
## $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi~
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.

# summary of variables in my data
summary(gapdata)
##    country               year           pop             continent        
##  Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
##  Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
##  Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
##                     Mean   :1980   Mean   :2.960e+07                     
##                     3rd Qu.:1993   3rd Qu.:1.959e+07                     
##                     Max.   :2007   Max.   :1.319e+09                     
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1

If we want to see how it can be used to summarise nominal (categorical) data, we can apply it on the same data, but that had been imported using the read.csv() function, that automatically converts text columns into factors, which are vectors of strings/characters with predefined categories (note that this is no longer the default behaviour in R versions >4.0).

# apply summary on data imported with read.csv(), what are the differences?
summary(read.csv("https://tinyurl.com/gapdata", stringsAsFactors = TRUE))
##         country          year           pop               continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.001e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.794e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.024e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.960e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.959e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.319e+09                 
##  (Other)    :1632                                                    
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1  
## 

We can see that when a variable is defined as a factor (contains categories), summary() will automatically summarise and count how many times wach category appears in the data.

We can also sort the data to look at extreme values. As opposed to using summary(), max() or min(), this way we see all the values in the other columns, so we can see not only the minimal recorded lifeExp, but also in which country and at which year it was observed.

# look at the lowest lifeExpt
gapdata %>% arrange(lifeExp)
## # A tibble: 1,704 x 6
##    country       year     pop continent lifeExp gdpPercap
##    <chr>        <dbl>   <dbl> <chr>       <dbl>     <dbl>
##  1 Rwanda        1992 7290203 Africa       23.6      737.
##  2 Afghanistan   1952 8425333 Asia         28.8      779.
##  3 Gambia        1952  284320 Africa       30        485.
##  4 Angola        1952 4232095 Africa       30.0     3521.
##  5 Sierra Leone  1952 2143249 Africa       30.3      880.
##  6 Afghanistan   1957 9240934 Asia         30.3      821.
##  7 Cambodia      1977 6978607 Asia         31.2      525.
##  8 Mozambique    1952 6446316 Africa       31.3      469.
##  9 Sierra Leone  1957 2295678 Africa       31.6     1004.
## 10 Burkina Faso  1952 4469979 Africa       32.0      543.
## # ... with 1,694 more rows
# Look at top rich countries
gapdata %>% arrange(desc(gdpPercap))
## # A tibble: 1,704 x 6
##    country    year     pop continent lifeExp gdpPercap
##    <chr>     <dbl>   <dbl> <chr>       <dbl>     <dbl>
##  1 Kuwait     1957  212846 Asia         58.0   113523.
##  2 Kuwait     1972  841934 Asia         67.7   109348.
##  3 Kuwait     1952  160000 Asia         55.6   108382.
##  4 Kuwait     1962  358266 Asia         60.5    95458.
##  5 Kuwait     1967  575003 Asia         64.6    80895.
##  6 Kuwait     1977 1140357 Asia         69.3    59265.
##  7 Norway     2007 4627926 Europe       80.2    49357.
##  8 Kuwait     2007 2505559 Asia         77.6    47307.
##  9 Singapore  2007 4553009 Asia         80.0    47143.
## 10 Norway     2002 4535591 Europe       79.0    44684.
## # ... with 1,694 more rows

Descriptive Statistics plotting

we can use several graph types to look at our data and assess its distribution, outliers, categories and more. A good reference to visualisation types can be found at From Data to Viz website.
What type of variables do we have in each column? Knowing that can help us determine how we can visualise each one.
Do find out, we can look again at the results of summary() to refresh our memory (or scroll up).

We have two character categorical variables, country and continent and 4 numerical variables (year, pop, lifeExp and gdpPercap). Are all the numerical variables similar? Do we treat year as a continuous numerical variable or as a nominal one? We need to decide based on our analysis.

Histogram

A histogram counts how many observations we have from each value and lets us assess how the data is distributed. When applied to categorical data, it shows us a graphic visualisation of the same output as running gapdata %>% dplyr::count(continent) (I’m using the double colon :: to tell R to use the count() function from dplyr package).

gapdata %>% dplyr::count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24
ggplot(gapdata, aes(x=continent, fill=continent)) +
  geom_histogram(stat = "count", width=0.8) +
  scale_fill_brewer(palette = "Set1") + theme_bw(16)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Figure  3: Histogram of continent data.

Figure 3: Histogram of continent data.

The real power of a histogram though comes when we apply it to a continuous numerical variable. However, when the data is truly continuous, the range need to be divided to “bins” into which each observation falls and then the sum of observation in each bin is presented.

ggplot(gapdata, aes(x=lifeExp)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(16)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure  4: Histogram of life expectancy data.

Figure 4: Histogram of life expectancy data.

We can increase the number of bins for higher resolution, or decrease it for a more coarse look. We can also colour each continent to get a better understanding of the distribution of values from each continent.

ggplot(gapdata, aes(x=lifeExp, fill=continent)) +
  geom_histogram(bins = 100) +
  scale_fill_brewer(palette = "Set2") +
  theme_bw(16)
Figure  5: Histogram of life expectancy data by continent.

Figure 5: Histogram of life expectancy data by continent.

Can you try and plot a histogram of the years, what type of data is it based on the output?

Density Plots

Another variant of the histogram is a density plot, which is a “smoothed” version of the histogram. If we apply some transparency to the fill colour (using the alpha=0.5 argument), we can see where the data from the continents overlap (see Figure 6) .

ggplot(gapdata, aes(x=lifeExp, fill=continent)) +
  geom_density(alpha=0.5) +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw(16)
Figure  6: Density plot of life expectancy data by continent.

Figure 6: Density plot of life expectancy data by continent.

Boxplot

A more detailed method to look at distribution of continuous numerical data is using a boxplot. These are visual representations of the descriptive statistics provided by summary() command. Each boxplot visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually. The hinges (top and bottom of box) represent the range between the 25%-75% quantiles, while the length whiskers represent 1.5*IQR (interquantile range) from the hinges. Data points that fall outside this range are considered “outliers” and are plotted individually.

ggplot(gapdata, aes(x=continent, y=lifeExp,  colour=continent)) +
  geom_boxplot(width=0.25) +
  scale_colour_brewer(palette = "Set1") +
  theme_bw(16)
Figure  7: Box plots of life expectancy data by continent.

Figure 7: Box plots of life expectancy data by continent.

A box plot representation gives us a better tool to compare groups.

Violin Plots

We saw that the boxplot let us get a better understanding of how the data is distributed around the median, but it doesn’t give us the notion of the density of data observations that were collected along the range (like the histogram did).
A violin plot brings the best in both worlds and it is basically a vertical, 2-sided density plots that are plotted along the Y-axis (see Figure 8).

ggplot(gapdata, aes(x=continent, y=lifeExp,  fill=continent)) +
  geom_violin(draw_quantiles = c(0.2,0.8), trim = FALSE) +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw(16)
Figure  8: Violin plots of life expectancy data by continent.

Figure 8: Violin plots of life expectancy data by continent.

Note that in R we can easily determine which quantiles we wish to plot. If we specify trim=TRUE, the violin plots will be trimmed and will not include the outliers.

We can then go ahead and combine violin with boxplots, along with a few adjustments, such as calculating and plotting the average value, to make the plot “publication” ready (Figure 9).

# calculate mean
mean_values <- gapdata %>% group_by(continent) %>% dplyr::summarise(lifeExp=mean(lifeExp))
ggplot(gapdata, aes(x=continent, y=lifeExp, colour=continent)) +
  geom_violin(aes(fill=continent), alpha=0.6) +
  geom_boxplot(width=0.15) + 
  geom_point(data = mean_values, size=2.5, pch=18) +
  scale_fill_brewer(palette = "Dark2") +
  scale_colour_brewer(palette = "Dark2") +
  labs(y="Life Expectancy", fill="Continent", colour="Continent", x="") +
  theme_bw(16)
Figure  9: Combined violin and box plots of life expectancy data by continent.

Figure 9: Combined violin and box plots of life expectancy data by continent.

Raincloud Plots

Another option to show the distribution of data points, along with some descriptive statistics, is by using “Raincloud Plots”, which can convey a lot of useful information. See details in Visualizing Distributions with Raincloud Plots using ggplot2 (blog post)

tweet_embed("https://twitter.com/CedScherer/status/1401604888900083713?s=20")

Interactive exploration

Sometimes it is very useful to be able to interact with the plot, to interrogate outliers and get the values in real time. We can use the plotly package for that on the boxplot we created earlier.

plot <- ggplot(gapdata, aes(x=continent, y=lifeExp,  colour=continent)) +
  geom_boxplot(width=0.25) +
  scale_colour_brewer(palette = "Set1") +
  labs(y="Life Expectancy", colour="Continent", x="") +
  theme_bw(16)
plotly::ggplotly(plot)

Figure 10: Interactive box plots of life expectancy data by continent (powered by Plotly).

Use your mouse to hover over the graph in Figure 10 to get details of outliers and summary statistics for each continent.
This is a great way to share plots between students/supervisors/collaborators to engage discussion on data interpretation and analysis and a tool to allow the public to interact with plots that are hosted online (ask me how!).

Downloading files from Binder

After we’ve finished working on Binder we can download the R script that we wrote and any output files (summary tables and figures). We can access those files by using the files tab in RStudio (bottom right pane).
Select the files/folders that you would like to download and click on More Export… (see screenshot in Figure 1 below) to save the file on your computer.

Figure  1: Download files from Binder/RStudio screenshot.

Figure 1: Download files from Binder/RStudio screenshot.

Additional Resources

  • Stats and R blog – Contains useful explanations and R code on statistics, including the very useful posts on Descriptive statistics by hand and Descriptive statistics in R
  • Statistical tools for high-throughput data analysis (STHDA) – An extremely useful website with all things statistics, data analysis and visualisation in R. For this tutorial, look at the Descriptive statistics and graphics page
  • From Data to Viz leads you to the most appropriate graph for your data – https://www.data-to-viz.com/
  • Coding Club: A Positive Peer-Learning Community (ecology and environmental science students and researchers from the University of Edinburgh) – https://ourcodingclub.github.io/
  • Beautiful plotting in R: A ggplot2 cheatsheet – provides lots of recipes and easy fixes for common tasks when creating ggplot2 plots (link)
  • The Evolution of a ggplot – a series of tutorials to create beautiful and informative plots in R (link)
tweet_embed("https://twitter.com/WeAreRLadies/status/1280859374668283906")
  • paletteer – An R package that lets you use colour palette from a range of available packages (see examples of available palettes here)
  • Plot anything with ggplot2 - YouTube workshops
  • How to write dynamic documents like this one? Learn RMarkdown using this “Definite Guide
  • Useful cheatsheets can be found on the RStudio Cheatsheets website (start with the most useful/basic ones relevant to this course - RStudio IDE, Data Import, Transformation and Visualization, RMarkdown and more)

Contact

Please contact me at for any questions or comments.